Further Optimal Regret Bounds for Thompson Sampling
نویسندگان
چکیده
The second last inequality follows from the observation that the event E i (t) was defined as μ̂i(t) > xi, At time τk+1 for k ≥ 1, μ̂i(τk+1) = Si(τk+1) k+1 ≤ Si(τk+1) k , where latter is simply the average of the outcomes observed from k i.i.d. plays of arm i, each of which is a Bernoulli trial with mean μi. Using Chernoff-Hoeffding bounds (Fact 1), we obtain that Pr(μ̂i(τk + 1) > xi) ≤ Pr(ik k > xi) ≤ eii.
منابع مشابه
A Near-optimal Regret Bounds for Thompson Sampling
Thompson Sampling (TS) is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated that it has favorable empirical performance compared to the state of the art methods. In this paper, a novel and almost tight martingale-based regret analysis for Thompson ...
متن کاملA Near-optimal Regret Bounds for Thompson Sampling1
Thompson Sampling (TS) is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated that it has favorable empirical performance compared to the state of the art methods. In this paper, a novel and almost tight martingale-based regret analysis for Thompson ...
متن کاملRegret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits
I prove near-optimal frequentist regret guarantees for the finite-horizon Gittins index strategy for multi-armed bandits with Gaussian noise and prior. Along the way I derive finite-time bounds on the Gittins index that are asymptotically exact and may be of independent interest. I also discuss computational issues and present experimental results suggesting that a particular version of the Git...
متن کاملAnalysis of Thompson Sampling for the Multi-armed Bandit Problem
The multi-armed bandit problem is a popular model for studying exploration/exploitation trade-off in sequential decision problems. Many algorithms are now available for this well-studied problem. One of the earliest algorithms, given by W. R. Thompson, dates back to 1933. This algorithm, referred to as Thompson Sampling, is a natural Bayesian algorithm. The basic idea is to choose an arm to pla...
متن کاملAn Information-Theoretic Analysis of Thompson Sampling
We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimal-action distribution. This strengthens preexisting results and ...
متن کامل